(adapted from Text Mining with R: a Tidy Approach by J. Silge and D. Robinson)
In this notebook, you will analyze the emotional content in Shakespeare's Macbeth.
Most humans have a good native understanding of the emotional intent of words, which leads us to infer surprise, disgust, joy, pain, and so forth, from a text segment.
The process, when applied by machines, is called sentiment analysis or opinion mining.
There are numerous challenges, as one can imagine, namely:
Sentiment analysis is a supervised learning problem, which requires dictionaries of emotional content to have been compiled ahead of time.
In general, we can analyze the sentiment using the word-by-word method as follows:
We will use the sentiment lexicons that are included with the tidytext dataset: AFINN, nrc, bing, loughran.
library(tidytext)
head(sentiments) # get all the lexicons as one tibble
table(sentiments$lexicon) # see what lexicons are available
AFINN = get_sentiments("afinn") # words on a scale from -5 (negative) to 5 (positive)
BING = get_sentiments("bing") # binary negative/positive
NRC = get_sentiments("nrc") # assigns categories of sentiments (possible more than one to a term)
LOUGHRAN = get_sentiments("loughran")
Let's take a quick look at the 4 lexicons (they do not all contain the same number of observations).
AFINN = get_sentiments("afinn") # words on a scale from -5 (negative) to 5 (positive)
BING = get_sentiments("bing") # binary negative/positive
NRC = get_sentiments("nrc") # assigns categories of sentiments (possible more than one to a term)
LOUGHRAN = get_sentiments("loughran")
str(AFINN)
str(BING)
str(NRC)
str(LOUGHRAN)
The sentiment categories (and distributions) can be accessed using basic R commands.
table(AFINN$score)
table(BING$sentiment)
table(NRC$sentiment)
table(LOUGHRAN$sentiment)
At a first glance, it seems that there are more terms in the negative end of the "spectra". What kind of an effect do you think that could have?
How do the various lexicons grade specific words? Let's take a look at a few possibilities:
word = "abandon"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
word = "bad"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
word = "not"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
word = "cool"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
word = "egregious"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
word = "strike"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
COMMENTS:
bad is identified as a negative word, not is seen as neutral, but not bad would have to be a positive 2-gram. There are ways to avoid these issues, but we will only focus on unigrams in this workshop. We start by creating a custom lexicon for the works of Shakespeare at the Gutemberg Project.
word = c("etext", "copyright", "implications", "electronic", "version", "william", "shakespeare", "inc", "gutenberg", "electronic", "machine", "distributed", "commercially", "commercial", "distribution", "download", "shareware")
lexicon = rep("custom",17)
custom = data.frame(word,lexicon)
stop_words_custom_gut = rbind(stop_words,custom)
Now, prepare a tidy dataset for Slick Willy.
library(gutenbergr)
library(dplyr)
library(stringr) # necessary to use str_detect, str_extract
will_shakespeare <-gutenberg_download(c(1112,1524,2264,2242,2267,1120,1128,2243,23042,1526,1107,2253,1121,1103,2240,2268,1535,1126,1539,23046,1106,2251,2250,1790,2246,1114,1108,2262,1109,1537))
tidy_ws <- will_shakespeare %>%
unnest_tokens(word,text) %>%
mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
anti_join(stop_words_custom_gut) %>% # removing the heading business
na.omit() # remove NAs
Now, let's extract the surprise words from A Midsummer Night's Dream (according to the NRC lexicon):
nrc_surprise <- NRC %>%
filter(sentiment == "surprise") # to only keep the surprise terms
tidy_ws %>%
filter(gutenberg_id == 2242) %>% # gutenmberg ID for MND
inner_join(nrc_surprise) %>%
count(word, sort = TRUE)
What do you think?
For comparison's sake, let's also look at anger words.
nrc_anger <- NRC %>%
filter(sentiment == "anger") # to only keep the anger terms
tidy_ws %>%
filter(gutenberg_id == 2242) %>% # gutenmberg ID for MND
inner_join(nrc_anger) %>%
count(word, sort = TRUE)
library(tidyr) # to be able to use the spread functionality
library(dplyr)
library(readr) # to be able to use read_lines
macbeth = read.csv("Data/Macbeth.csv",header=TRUE, sep=",", stringsAsFactors=FALSE)
str(macbeth)
The Act and Scene variables could be combine to provide an increasing identifier for the play's sections.
Ultimately, we only want to keep information on the text, the line number, and the section.
macbeth$section=macbeth$Act*10+macbeth$Scene
table(macbeth$section)
macbeth <- macbeth %>% select(c("Text","Play_Line","section"))
head(macbeth)
Now, let's unnest the tokens using word as a basic unit.
library(tidytext)
tidy_macbeth <- macbeth %>%
unnest_tokens(word, Text)
head(tidy_macbeth,25)
Then, we get a sentiment score for each word using the Bing lexicon (words that don't appear are considered to be neutral).
library(tidyr)
macbeth_SA <- tidy_macbeth %>%
inner_join(get_sentiments("bing")) # we will use the bing lexicon to categorize words into negative and positive
head(macbeth_SA)
dim(macbeth_SA)
Next, we count the positive and negative words in each "section" of the book. This index counts up sections of $L$ lines of text.
macbeth_SA <- tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
count(index = Play_Line %/% 30, sentiment) # here we're using L = 30
head(macbeth_SA)
dim(macbeth_SA)
The counts are stored in the variable $n$. Let's reshape the tibble into a tidy dataset (reminder: each column hosts 1 variable, each row 1 observation).
macbeth_SA <- tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
count(index = Play_Line %/% 30, sentiment) %>%
spread(sentiment, n, fill = 0) # we'll get 2 columns:
head(macbeth_SA)
dim(macbeth_SA)
Finally, let's compute the overall sentiment for each block of lines as the difference between its positive and negative term counts.
macbeth_SA <- tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
count(index = Play_Line %/% 30, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head(macbeth_SA)
And... well, that's it, really. It's fairly easy to plot the outcome.
library(ggplot2)
ggplot(macbeth_SA, aes(index, sentiment)) +
geom_col(show.legend = TRUE)
The overall picture seems to be somewhat negative -- but is that surprising? Macbeth is a tragedy, after all, arguably Shakespeare's darkest.
But perhaps what we're seeing is an artifact of the way we have blocked the play, or the length of the blocks, or even of the sentiment lexicon that we've elected to use. Let's look into this a little bit more.
Smaller number of blocks
macbeth_SA <- tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
count(index = Play_Line %/% 50, sentiment) %>% # use n=50 instead
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(macbeth_SA, aes(index, sentiment)) +
geom_col(show.legend = TRUE)
Different sectioning mechanism (Act and Scene separation instead of arbitrary number of lines)
macbeth_SA <- tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
count(index = section, sentiment) %>% # use n=50 instead
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(macbeth_SA, aes(index, sentiment)) +
geom_col(show.legend = TRUE)
Different lexicons
afinn_macbeth <- tidy_macbeth %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = Play_Line %/% 30) %>%
summarise(sentiment = sum(score)) %>% # because the caterogies are numerical, so sum instead of count
mutate(method = "AFINN")
bing_nrc_loughran_macbeth <- bind_rows(
tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "BING"),
tidy_macbeth %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative"))) %>% # because there are other sentiments in NRC
mutate(method = "NRC"),
tidy_macbeth %>%
inner_join(get_sentiments("loughran") %>%
filter(sentiment %in% c("positive","negative"))) %>% # because there are other sentiments in NRC
mutate(method = "LOUGHRAN")) %>%
count(method, index = Play_Line %/% 30, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn_macbeth,
bing_nrc_loughran_macbeth) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
so... what do you think? Is the evidence conclusive?
We can also look at how often specific words contribute to positive and negative sentiments.
bing_word_counts <- tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
head(bing_word_counts)
tail(bing_word_counts)
And visualize them (bar charts, word clouds).
# bar charts
bing_word_counts %>%
group_by(sentiment) %>% # will create 2 graphs
top_n(10) %>% # pick only the top 10 in each category
ungroup() %>% # required to avoid a warning message below
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) + # plot a bar chart of word count
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") + # there will 2 such bar charts, one for each sentiment
labs(y = "Contribution to sentiment",x = NULL) +
coord_flip() # horizontal bar charts
# wordcloud
library(wordcloud)
word = c("thou", "thy", "thee", "tis", "hath")
lexicon = rep("custom",5)
custom2 = data.frame(word,lexicon)
stop_words_custom_macbeth = rbind(stop_words,custom2)
tidy_macbeth %>%
anti_join(stop_words_custom_macbeth) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
# comparison cloud
library(reshape2)
tidy_macbeth %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>% # counting words for the whole play this time
acast(word ~ sentiment, value.var = "n", fill = 0) %>% # reshaping as a matrix with acast() for comparison cloud
comparison.cloud(colors = c("#660000", "#000066"),
max.words = 100)
Nothing jumps at us as being amiss (which is no guarantee that there's no problem, but it's at least a good sign).
COMMENTS: